Author

Kevin Linares

Published

December 3, 2024

Scrapping civilian casualty incidents from Airwars

___________________________________________________________________________


Computing Resources:

This project was created on a lenovo legion 7i laptop with a i9-14900HX chip, 64gb of DDR5 RAM, and an Nvidia RTX-4070 GPU with 8gb of GDDR6. The operating system initially used was Ubuntu 24.10. However, when we attempted to configure our GPU to process text data for the language model we learned that this version of Ubuntu contains the newest kernel which updates nvidia-cli and cuda drivers that are not compatible with tensorflow or pytorch needed for the text package. We moved to Ubuntu 24.04 LTS within WSL2 on Windows 11 and we were able to configure the GPU. Given that Airwars also goes back and corrects archived incidents, it is easier to just run the full process on all available incident records when needed, and the GPU cuts back on the processing time. Modeling data on a GPU for the language model cut the process time from 2.5 hours to about 20 minutes.

We use r-base 4.4.1 from anaconda and rstudio 2024.04.02. We have also used these same packages on rstudio-server via WSL2 but prefer to isolate the computing environment.


We store all of our data in a SQLite database that can also be found in our github repository


Structure of SQlite database

The scrapped data resulted in over 800 unique events stored in two tables in a SQLite database.

  • Table 1 contains incident metadata (e.g., unique id, incident date, web-page URL).

  • Table 2 stores the specific incident information such as the number of deaths, breakdown of deaths (children, adults), type of attack, and cause of death, incident coordinates and results from Nominatim, and sentiment scores for seven emotional states.

  • Table 3 contains the Hamas Ministry of Health (MoH) daily casualties.

  • The first two tables relate to each other through the unique incident identification numbers provided by Airwars. We relate the MoH table with the Airwars tables by aggregating up to the date.

# connect to database
mydb <- dbConnect(
  RSQLite::SQLite(), 
  "~/repos/airwars_scraping_project/database/airwars_db.sqlite")

dbListTables(mydb) # print tables in database
[1] "airwars_incidents" "airwars_meta"      "daily_casualties" 


Scraping Airwars Civilian Casualty Incidents

  • The image below is an example of the Airwars incident metadata that are presented as baseball cards. This information is presented in one web-page and we start our workflow at this junction. We read the main website in Airwars that houses this information and only scrape Incident Date and Incident ID to build specific incident web-URLs that we later scrape for content.
    • All of the code to conduct the scraping and processing of these data are found in our github page under code/scrape_process_incidence, which this code has been optimized and takes about 30 minutes on our laptop (32gb of RAM is sufficient) with a fast internet connection.
    • Here we only explain how we pre-processed the data as it related to preparing for analysis.
    • For a lot of the scraping we used selectorgadget to get the xpath and pass it through the Rvest package.

Example of Incident Metadata


  • Metadata table: We scrape the main Airwars website parse information we need to build a table containing each incident’s web-url (over 800 URLs), as seen in the example below.
# read in data tables
airwars_meta <- tbl(mydb, "airwars_meta") |> 
  as_tibble() |> 
  # convert Incident_Date to date format
  mutate(Incident_Date = as_date(Incident_Date)) |> 
  arrange(Incident_Date)

airwars_meta |> head() |> kable()
Incident_Date Incident_id link
2023-10-07 ispt0019a https://airwars.org/civilian-casualties/ispt0019a-october-7-2023/
2023-10-07 ispt0019 https://airwars.org/civilian-casualties/ispt0019-october-7-2023/
2023-10-07 ispt0017 https://airwars.org/civilian-casualties/ispt0017-october-7-2023/
2023-10-07 ispt0011 https://airwars.org/civilian-casualties/ispt0011-october-7-2023/
2023-10-07 ispt0010 https://airwars.org/civilian-casualties/ispt0010-october-7-2023/
2023-10-07 ispt0003 https://airwars.org/civilian-casualties/ispt0003-october-7-2023/
  • Using the web-urls we built in the metadata table, we loop through them and scrape each URL (see below for example) to parse incident assessments as seen below.
    • Each incident contains an assessment section detailing what transpired during the incident, whom was known to be involved and the victims it produced. We will use this text to get emotional scores later.

Example of Airwars Incident

Example of Airwars Incident
  • Incidence table: Our final table contains the fields that Airwars populates for each incident. Besides parsing this information we also had to process the data, specifically, there are fields that contain ranges of kills (i.e., 3-5) or counts (i.e., 1 child, 3 women, 1 man) which we had to strip these strings into their own columns. This allows us to estimate how many children and women have been reported as civilian casualties. Our data contains 24 variables with 804 incidents reported by Airwars.
airwars_incidents <- tbl(mydb, "airwars_incidents") |> 
  as_tibble() 

airwars_incidents |> 
  head() |> 
  select(-assessment:-surprise) |> 
  kable()
Incident_id Strike status Strike type Civilian infrastructure Civilian harm reported Civilians reported killed Civilians reported injured Cause of injury / death Airwars civilian harm grading Impact Suspected belligerent min_killed max_killed casualty_estimate killed location_meta incident_lat incident_long target_type target_address_type lat_min lat_max long_min long_max Suspected belligerents Known belligerent Suspected target Known target Causes of injury / death children_killed women_killed men_killed Civilian_type
ispt0001 Single source claim Airstrike and/or Artillery Healthcare facility Yes 1 2 Heavy weapons and explosive munitions Fair Healthcare Israeli Military 1 1 absolute 1 Rafah the Gaza Strip 31.296628 34.244689 tertiary road 31.29471 31.29727 34.24404 34.24777 NA NA NA NA NA NA NA 1 (1 man1 healthcare_personnel)
ispt00012 Likely strike Airstrike Healthcare facility Yes 1 4 Heavy weapons and explosive munitions Fair Healthcare Israeli Military 1 1 absolute 1 Nasser medical complex Khan Younis the Gaza Strip 31.347002 34.292327 hospital amenity 31.34554 31.34852 34.29053 34.29406 NA NA NA NA NA NA NA 1 (1 man1 healthcare_personnel)
ispt00013 Likely strike Airstrike Healthcare facility Yes Unknown 5 Heavy weapons and explosive munitions Fair Healthcare Israeli Military NA NA NA NA Nasser medical complex Khan Younis the Gaza Strip 31.347153 34.292774 hospital amenity 31.34554 31.34852 34.29053 34.29406 NA NA NA NA NA NA NA NA NA
ispt0002 Likely strike Airstrike Residential building Yes 3 – 7 4 Heavy weapons and explosive munitions Fair NA Israeli Military 3 7 range 5 Khuza a Khan Younis the Gaza Strip 31.522567 34.462869 tertiary road 31.52175 31.53208 34.46157 34.47451 NA NA NA NA NA 2 3 1 (2 children3 women1 man)
ispt0003 Likely strike Airstrike NA Yes 1 3 Heavy weapons and explosive munitions Fair NA Israeli Military 1 1 absolute 1 Al Baraa Mosque Gaza the Gaza Strip 31.501846 34.437058 place_of_worship amenity 31.50165 31.50191 34.43679 34.43723 NA NA NA NA NA NA NA 1 (1 man)
ispt0004 Likely strike Airstrike Residential building Yes 15 NA Heavy weapons and explosive munitions Fair NA Israeli Military 15 15 absolute 15 home of the Al Dous family in Al Zaytoun neighborhood south of Gaza City Gaza the Gaza Strip 31.485479 34.444885 residential road 31.48442 31.48662 34.44390 34.44632 NA NA NA NA NA 7 3 5 (7 children3 women5 men)


MoH Daily Casualties

Palestine Dataset published daily Gaza casualty counts that they take from the Hamas MoH; however, they do not make a distinction whether a casualty was a civilian or militant so their numbers should be higher than what we derive from Airwars.1

  • We use the Palestine API https://data.techforpalestine.org/api/v2/casualties_daily.json and parse the JSON to save into our database after a little bit of data wrangling.
tbl(mydb, "daily_casualties") |> 
  as_tibble() |> 
  mutate(Incident_Date = lubridate::as_date(Incident_Date)) |> 
  head() |> 
  kable()
Incident_Date name value
2023-10-07 Children 0
2023-10-07 Women 0
2023-10-07 Total 232
2023-10-08 Children 78
2023-10-08 Women 41
2023-10-08 Total 370


Enriching Data with Reverse geocoding

  • Airwars when possible includes location coordinates of where the incident took place. Although this information is contained within the assessment, Airwars standardizes it’s location with a heading under “Geolocation notes” which we were able to parse the latitude and longitude to use for geographic plotting. Of the 804 Incidents about 65% contain geographic coordinates.

    • We used the Nominatim open street map API to reverse geocode our coordinates and bring back the type of location that was the location target for incidents that contained coordinates. We also save out a boundary box set of coordinates.
    airwars_incidents |> 
      select(target_type, contains("lat"), contains("long")) |> 
      head() |> 
      kable()
    target_type incident_lat lat_min lat_max incident_long long_min long_max
    tertiary 31.296628 31.29471 31.29727 34.244689 34.24404 34.24777
    hospital 31.347002 31.34554 31.34852 34.292327 34.29053 34.29406
    hospital 31.347153 31.34554 31.34852 34.292774 34.29053 34.29406
    tertiary 31.522567 31.52175 31.53208 34.462869 34.46157 34.47451
    place_of_worship 31.501846 31.50165 31.50191 34.437058 34.43679 34.43723
    residential 31.485479 31.48442 31.48662 34.444885 34.44390 34.44632


Sentiment Analysis

  • After attempting several text classification models and some question/context model we landed onj-hartmann/emotion-english-distilroberta-base because it goes beyond just a positive/negative evaluation but analysis text for Ekman’s 6 basic emotions that is common in psychological work on emotions.Moreover, this model affords us the ability to examine the emotion tone over time for these assessments.2

    • We get scores for each emotions, the closer to one the stronger the association, while all the scores add up to 1.

    • The model is trained on a balanced subset from the datasets listed above (2,811 observations per emotion, i.e., nearly 20k observations in total). 80% of this balanced subset is used for training and 20% for evaluation. The evaluation accuracy is 66% (vs. the random-chance baseline of 1/7 = 14%).

    • Given that we have over 800 assessments we decided to use text3 while it allows us the ability to use a laptop GPU (GTX 4070)4 to process these models for each incident. This resulted in large processing gains.

    • Below we print an example of these scores while we truncate the assessment text.

airwars_incidents |> 
  slice_sample(n=1) |> 
  select(assessment:surprise) |> 
  mutate(assessment = str_trunc(assessment, 200),
         across(where(is.double), ~ round(.x, 2))) |> 
  kable()
assessment anger disgust fear joy neutral sadness surprise
Shortly before 6pm on Sunday, 29th October 2023, at least 35 civilians, including at least eight women and 16 children, were reportedly killed and dozens were injured in an alleged Israeli airstrik… 0.02 0 0.2 0 0.01 0.74 0.01

Footnotes

  1. Note. Confidence is low to moderate since the data comes from the Hamas MoH.↩︎

    • The model is trained on a balanced subset from the datasets listed above (2,811 observations per emotion, i.e., nearly 20k observations in total). 80% of this balanced subset is used for training and 20% for evaluation. The evaluation accuracy is 66% (vs. the random-chance baseline of 1/7 = 14%).
    ↩︎
  2. An R-package for analyzing natural language with transformers from HuggingFace using Natural Language Processing and Machine Learning.↩︎

  3. The installation for Text is tricky as the right python libraries must be installed. To compile models with the GPU, we learned that nvidia cuda drivers must be installed for version 12.1. Additionally, we could only get this to work via anaconda within Ubuntu 24.04 installed through WSL2 on Windows 11. Ubuntu 24.10 comes with a kernal that forces cuda 12.8 to be installed and did not work for us in a dual boot system.↩︎